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ABSTRACT 

As part of a larger project to assess citanges in 
student learning resulting frc^ school reform > this study equates 
levels 6 through 14 of the loathesatics and reading c(^rehension 
components of Forsi 7 of the loifa Tests of Basic Skills (ITBS) with 
levels 7 through 14 of the mathematics and reading ctwaprehenslon 
components of the CPS90 (another version of the ITBS) , using a Basch 
analysis. The analysi. results in the coim>n calibration of all 1,031 
mathematics items found in the 17 levels of the tffo test forms to 
define a mathezsatics variable and all 602 reading items to define a 
reading variable. Each item in each subject obtains a person J'ree 
calibration (in lo^its) of its own level of difficulty on one common 
scale linking all items of that subject. The 17 levels of the two 
tests were successfully equated so that a person taking the CPS90 or 
Form 7 (or a combination of items from the forms targeted at his or 
her ability level) will obtain statistically equivalent measures of 
ability. Logit measures give a more accurate picture of student rate 
of growth than do grade equivalents, with rates of growth highest at 
the lower grades and decreasing in the higher grades. Four tables, 13 
figures, and 6 references are included. An appendix lists the 
criterion definitions of variables. (SLD) 
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ReadiQg sbA ifat^esMtics E({tiAting Study^ 

XntrodoeCloa 



This siMOf is part of a larger project Intended to assess school change In 
student learning as a result of school refers. In order to do this ve want to 
look at lis|>roveTOnts in students' acadeale acMemaoeat over tine. Current policy 
of the Chicago Public Schools is to change die forn of die ITBS each year. 
Sufficient nuBber of anoaalles appeared after the first change in term that 
prcopted questions on adequacy of equating by grade equivalents, at least as 
applied to Chicago schools. Uioless these test forms are equated* it is not 
possible to compare student performances from year to year to determine school 
change. The Easton. Bean, and Bryk paper (1991) points out diat earlier studies 
(Prank and Seltzer, 1990) using longitt«iinal data bases had shoim the inadequacy 
of the grade equivalent scores for determining growth. Schulz , Shen and Wright 
(1990), point out that the construction of the grade equivalent laetric is such 
that students show an average annual gain of one grade equivalent irrespective 
of their actual changes in ability. The incorporation of time into grade 
equivalents resoves the possibility of ^ietemining growth rates. 

This study equates levels 6 throu^ 14 of the Matheaatlcs and Reading 
CoBprehenslon components of tb«. Iowa Tests of Basic Skills (ITBS For» 7) with 
levels 7 throu^ 14 of the Mathenatics and itoading Cos^rehension components of 
the CPS90 (another version of the ITBS), using Rasch analysis (Wright & Douglas. 
1975, Wright, B.D., 1977, Wright & Stone, 1979). The analysis results in the 
coamon calibration of all 1031 aatheaatics *'eiBs found in the 17 levels of the 
two test forms to define a aiath variable, an . all 602 reading iteas to define a 
reading variable. Each itfflB in each subject <^tains a person<free calibration 
(in logits) of its own level of difficulty on the one cooson scale linking all 
items of that subject. 
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I>esigEi and Method 



Test linking in this stucfy was done with coascm persons ax«i 

design is in Figure 1 r«.k Persons and coaaon iteas. The 

6« *» iu rigure i. Each arrow represents a ffrmm ««« 

of teats. The initial deai« .-^v . ^ ^'"^ « P*^^ 

lie xnicial (Sesign took into consideration the ne«« 

-*.r of student. w™d 1„ the u™u W T^l. 

w. xTJ. "rr.'- l!!!-^-^-'- .PP.« 

respective foras. Level 14 
of Forw 7 shares 67% of its 
iteias with Uvel 13. Unking 
was strengthened by adding 
existing data^ for Levels 10 
and 12 of both Foras and 
Level 14 of Fon. 7. These 
data are froa the regular 
student testing, froo schools 
used in the stu<fy. Table 1 
lists the nuaber of iteas and 
nwiAer of students used in 
the analysis, for each of the 
test levels. 



The Calibration Matrices 



defined by standard Chicago 




The data were cleaned in four 
stages: (1) Only response 

strings narked valid as y4^^^ . „ ; — ~ . 

Flgura i Equating study Design 
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Test 




asiioiite 


fem 7 


CP890 


Fons 7 


CPS90 




ItcBB Persons 


ItesB 9ereens 


Itens Persons 


1. Level 6 

2. Lewi ? 

3. Level 8 

4. Level 9 

5. Level 10 

6. Level 11 

7. Level 12 

8. Level IS 

9. Level U 


ss 245 
/81 459 
/ 88 502 
90 282 
99 196 
109 156 
114 196 

117 sro 

121 288 


82 ses 

96 S50 
86 S80 
95 205 
101 157 
109 198 
IIS 178 
117 191 


79 218 

66 466 

67 566 

44 299 
49 227 
54 177 

57 S29 

58 2S9 


56 SSI 
61 544 
44 453 
49 2S6 
94 209 
56 2» 
97 151 
58 179 



Public Schools' proc©d«r«s*, were Included^; (2) Response strings showing 
series of seroes and/or saa» resiK>nse8 for 25% or greater of the total nusiber of 
items, were dropped; (3) Hlsfitting persons on Rasch estiisates were resoved; and 
(4) Persons with large standardised differences in perfomances on their pair of 
tests were removed. About 12% of data were lost throufj^ cleaning. After data 
cleaning, the item response strings were liidced into one giant calibration matrix 
such that strings for a person talcing two tests are aligned into the same row and 
responses to a given item fall into the same column. T^is is diagrammed in 
Figure 2. 

Tests are arranged from the lowest test levels of Form 7 and CPS90 to the 
highest. This residts in a Mathematics calibration matrix with 1031 different 
items taken by 2995 different persons, and a Reading calibration matrix with 602 
different items taken by 3159 persons. 

Notice that these calibration matrices are only 15 percent filled with data. 
Nevertheless, reliable equating was accoiq)lished from Grade 1 through Grade 8. 
Rasch equating does not need complete data to calibrate items successfully onto 
a coiaaon scale or to obtain good estimates of person measures. 



Test strfrt^ Mere flsssed when they failed evaluatfon UKi«> one or von of the foilowlns criteria: 
(1) More than 3 outtipies; (2) 50-70S Ute and > 1 enbedded onits; (3) lite and > 0 eniwdded osiits. 

^ We Mould like to thank the Otieaso Public ^moIs for doing the first jtate of cleaiing by flass^ns 
invalid response strings. 
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Each matrix vas Rasch*^ analyzed In a otie*-step equating procedure and all tests 
were placed on a ocm^n logit scale. XtM» calibrations in difficulty logits» 
title log odds of an itefli provoking failure frrai a person irith ability equal to 
the scale sero. 9e nom have a bank of 1031 Mathsfiatics itess and amther bank 
of 602 Reading items* Fit statistics do not suggest the existeiK^e of diiiensions 
other than Mathematics and Reading in these tvo tests. 
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Figure 2 The Mathematics and Reading Calibration 
Matrices 



DatewBlntog Person Xeasttres Using the Matheaetlcs and leading Banks 



(a) When response strings are available 



Persons responding to any of the ITBS test levels equated here, can have their 
abilities estimated fr<» their responses by running a Kasch analysis on their 
responses irtiile anchoring the Itea difficulties on their baidc values. Aigr set 
of iteaw can be selected fron these banks to form a test targeted on a given 
group of persons, and person abilities estiaated in the s» way. A realistic 
standard error for each sieasure can be estiaated Inflated for observed person 
misfit. This is because Rasch estisiates are baaed on perfect fit and the 
standard errors for nisfitting persons tenU to be underestijaated. 

(b) When response strings are not available 

In longitudinal studies idiere tests irere iapleaented years ago, response strings 
are no longer available. The student measures therefore cannot be determined 
from an analysis of their responses. An indirect method based on their recorded 
grade-equivalents (GE's) must be used. The method is to regress the direct 
parson measures for each test level from the equating study, on their GE's for 
that test level. The person measures used were those of the individual test 
analyses of uncleaned data, with item difficulties for this step anchored on 
their bank values. The regression coefficients can then be used to predict 
student ability measures from the GE's they obtained in their earlier tests. 

Standard errors for these assures imist also be estimated. Again regression 
analysis was used. This time the dependent variables were the standard errors 
(inflated for misfit) of the measures from the direct analyses of uncleaned data. 

Mean Item Difficulty of Form 7 and CPS90 

Tables 2(a) and 2(b) show the mean item difficulty for each test level. The last 
columns of Tables 2(a) and 2(b) show the differences between the mean logit 
measures of CPS90 and Form 7. It Is clear that CPS90 is slightly harder than 
than Form 7 at most test levels. Mean test difficulties were plotted against 
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T«8t 
levat 


Grade 




of 
I teas 
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K 


3J 


7 


1 


81 


6 


2 


88 


9 


S 


90 


10 


4 


«9 


11 


S 


109 


12 


6 


1U 


IS 


T 


tir 


H 


8 


121 



Itm Df ff {^ty 



-3.51 
•2.64 

-i.rr 

•\.Vf 
-0.09 
1.QS 
1.90 
2.S1 
3.07 



S.O. 



1.SS 
1.12 
1.38 
1.82 
1.11 
0.88 
0.95 
0.97 
0.98 



of 
It«» 



82 
M 
88 
95 
101 
109 
113 
117 



Itm Dtfffculty 



-2.82 
-1,48 
-t.OS 
0.07 
0.98 
1.87 
2,13 
3.22 



1.17 
1.38 
1.29 
1.20 
1.M 
1.02 
0.95 
0.95 



Ls«ft 
Olff. 

CC9-F7) 



0.02 
0.29 
0.01 
0.18 
0.09 
-0.03 
0.22 
0.15 



g...e .«» ^^^^ ^ M.the«tlc. resp.ctlv,ly. The 

aiff«e«=e in «mculcie. b.e«,„ CPS90 ™. F,„ , i^^l 9 (cLe^I 

these dlffereiKe. „ore clearly, the, were plottetf agalMt .rad. ^ 
Figures 5 and 6. P "» against grate and shonn In 

Roclce fro- Table. 2(„ ^ ^ ^^^^ 

ca branona for Reading increase with grade, that is the Reading UeJ be ^ 

ZheiT r ■»"«^'- Li-tior 

Mathematics decrease with erada *.v , 

, , * ^'^^ closer toeether in 

difficulty level at the higher test levels. This requires further l T 

as to why It is so. "quires further investigation 
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REAPNGfe MEAN JTB4 DFROULTY VS GRADE 
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Figuro 3 Mot of tiean toodfng Urn OifffeuUfes 
agafnst (^acfe FMts 7 (laveta 6 tttrouRft f4> «nd CM90 
{Levels 7 tlirou^ 14>. 



MATH: KCAN fTB^ DFFCULTY VS GRADE 
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Figtire 4 Plot of Hean Natheaatics Item DIfftcuitfee 
against Grade for Form 7 {Levels 6 tiirot^ U) and 
Levels (7 through 
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Figuro 5 Ptot of Dlff^reneM bettiean 
Seeding Offflcuttftd of CPS90 imi 7 
apslnst Sracie* 



OFBOULTY OFF (0©^^) V8 GRADE CivO 



i iw i> i f nn ii iMmu i Mm ip n» 




Figmrft 6 Plot 0f 0fffermc«8 tetMMn 
fMlemtles Itesn Of ff {cut ties pf CPS90 and for« 
7 ^fnst Grade* 



Maan Maasuroe and Mean QxadB SqulvaXanta of Ccw^m Faraona taking Palra of Taata 

For the cosmon parsons taking a CFS90 and a Forai 7 test at the saoa levels » 
(arrows 3, 7« 11 » 12 » and 13 In Figure I), the aean laeasuras and grade 
equivalents were calculated. Results are shown in Tables 3 and 4 for Reading and 
Mathematics . 
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Since the saise persons took both tests, their aatched w»m neasures on 
the two tests should be statistically equivalent. It is shown graphically by 
plotting the aean masures against grade In Figure 7 (for Reading) . The sane was 
done for grade equivalents In Figure 8. Slallar plots are shown for Nathenatlcs 
in Figures 9 and 10. Note that the matched aean GE's for the saise persons are 
not the ssii« over the two test foras they took, for both the Reading and 
Hatbeiaatlcs. Students obtain hl^er grade equivalents from Fom 7 for both 
Reading and Matheaatics , except for Grade 7 Beading. This shows a bias in grade - 
equivalent equating of the IIBS. that Is, GZ*a produced by the two forms are not 
directly comparable. The GE plots are not even the straight lix^s we expect from 
GE scoring. For Grade 7 Reading, the mean Rasch loglt measure shows that CPS90 
is slightly harder than Form 7. In grade equivalents, however. Uie same students 
appear to have done better on CFS90. This apparent contradiction suggests the 
possibility that the norm group used for the CFS90 Grade 7 Reading could have 
been a less able group coo^>ared to the norm group for Form 7. Hence the same 
group of students in the equating study idien seen in terms of GE appear to have 
performed better on the CFS90 than on the Form 7 Grade 7 Reading. When compared 
in logit measures for common persons. Form 7 and CPS90 differences for all the 
grade levels are very close to zero as expected. This Is shown in Figures 11 and 
12. 
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«nd (b> Hean fiE'a against SnOe 



The differences In mean CE's. adjueted to the logit scale using the average 
exchange of 0.8 logits per grade so that the vertical scales are all comparable, 
were also plotted against grade in Figures 11 and 12. Here the differences in 
GE's between the two test forms are much larger than zero. 
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standard OeTiations of Measiaras and Grade Equivalents 

Froa Tables 3 ax)d 4 we see that for Reading and Matheaatics , the standard 
deviations o£ GE's increase vit^ grade while those of seasures do not. The 
spread of students in logit aeasures does not change such frms grade to grade. 
The increasing staiMlard deviations of tb» grade equivalents give the misleading 
impression that student spread increases » that thej get further apart. Figures 
13 (a) and 13(b) plot standard deviation againat grade. Note the relati-^/e 
constancy of the logit standard deviations and the systewaClc increase of the 
GE's standard Aviations across the grades. The illusion of increasing spread 
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produced by GE standard davlatlons could easily be aisuiMidrstood to prove that 
schooling Increases the dlf€eret»es aaong stiMlents. The loglt seasure plots show 
that this Is clearly not so. 

Criterion ]>ef inltlcm of Variable 

Appendix A is an exaa^le of a criterion definition of the variable called 
Nathena'clcs C<Mqmtatlon. D on the vertical axes is a licwar transfomatlon of 
itea r^tlibrations. D - 26 + S*(lteB difficulty). The vertical axis on the right 
shows the locations of the mean student ability at each grade. 

Such Iteu naps can readily be constructed once lt«&s have been calibrated, i^lch 
an Item bank of this kiiul enables. The matdi Iteas increase in c<»^lexity as the 
difficulty level Increases. This Is useful to teachers. Students' ioeasures are 
directly cosiparable to Itea difficulty calibrations. Reference to an itea map 
such as this, enables a teacher to detexalne what a stwlent has or has not 
oastered, where the student is in his aatheoatlcs education, and to plan his 
lessons acc^^rdlngly . 

Conclusion 

The 17 levels of the ITBS h«itheiBatlcs and Reading tests used In this study have 
been successfully equatfed and are each on a comnon scale of Item difficulty from 
K to 8. A person taking either CPS90 or Form 7 (or any coo^lnation of items from 
these two test forms targeted at his ability level) will obtain statistically 
equivalent measures of his ability. 

In the grade -equivalent metric, the difficulty of the test depends on the ability 
level of the normlng sasqple. A student's grade -equivalent depends on which test 
form he takes. As a result it is in^ssible to compute student abilities by 
studying the grade equivalents. Students scoring lower grade -equivalents on a 
given test may be thought to be less able, when the test may actually be harder 
or the normlng sample more able. Similarly, stxadents scoring higher grade - 
equivalents may not necessarily be of hlj^er ability since the test form may in 
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fsct be easier or the noming BBiqpIe loss able. Using grade •equivalents results 
in aisleading interpretations o£ student perfomuuiee. These have serious policy 
isqplications. Teachers say redraoaend reandial prograiBs for 8tu<tents who do not 
actually need ches. Students may be thou^^t to have acquired tfa^ desired level 
of c(»Bpetency idien they have not. FutwiB inay be channelled to the wrong programs 
for the wrong sti^nts. 

Students' rates of growth will never be shown by gr^^e equivalents. Every year 
they are forced to have otm unit of grade -equivalent higher. A plot of GE growth 
against grade is forced close to a straight line giving the false impression that 
the rate of growth is uniform at all ages. With logit measures, however, rates 
of growth are shown to be highest at the lower grades, and to decrease in the 
higher grades. 
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